The Paris Climate Agreement recognizes forests as a key part of the solution to the climate change challenge.Better land stewardship may provide 37% of the cost-effective climate change mitigation needed to keep global warming below 2C by 2030, with reforestation, avoided deforestation and natural forest management activities contributing to nearly two-thirds of this potential [1] REDD+ (reducing emissions from deforestation and forest degradation, and enhancing carbon stocks) was introduced at the UNFCCC Conference of Parties in Bali in 2007 and now more than 400 projects are registered under 'ongoing' status across countries as of July 2022. For this project exercise, I will focus on understanding REDD+ projects summary overview and developing some hypotheses to test its statistical significance, and report the result in Tableau dashboard.
The dataset is downloaded from International Database on REDD+ projects and programs (IDRECCO) with updates from 22 July 2022. And the purpose of this exercise is to:
The dataset has been structured with 10 sheets in Excel with each sheets contains information focused on specific areas such as Project, Carbon Certification, Financing Source, Community Level Intervention and Host Country etc. So let' start.
import pandas as pd
import csv
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
import copy
sns.set() #setting the default seaborn style for our plots
data_original = pd.read_excel(r'C:\Users\Nomuun\Desktop\Env Econ\REDD.xlsx', sheet_name = "1. Project")
data_original.columns = [x.lower() for x in data_original.columns]
data_original.head(3)
| project id | project name | secondary name | last idrecco update (yyyymmdd) | size (in hectare) | size of crediting area (in hectare) | start year | end year | duration | project description | ... | data quality - carbon transacations | data quality - financing sources | data quality - community interventions | status | longitude (decimal degrees) | latitude (decimal degrees) | multiple locations? | region | jurisdiction level 1 | jurisdiction level 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100 | PRC | Commercial reforestation on lands dedicated to... | 2020-09-04 | 3137.0 | 9999.0 | 2000 | 2030 | 30 | The proposed A/R CDM project activity consists... | ... | good data | good data | good data | Ongoing | -74.647083 | 9.930694 | No | South America | Department : Magdalena | Municipality : El Banco |
| 1 | 101 | Sierra Gorda Premium Carbon: Carbon Sequestrat... | Carbon Sequestration in Communities of Extreme... | 2020-09-04 | 247.0 | 247.0 | 1997 | 2042 | 46 | Bosque Sustentable A.C. is working with privat... | ... | good data | good data | good data | Ongoing | 9999.999999 | 9999.999999 | Non | South America | State : Querétaro and State : San Luis Potosí | Municipality : Pinal de Amoles, Jalpan de Serr... |
| 2 | 102 | Scolel 'te | Scolel té Natural Resources Management and Car... | 2020-09-04 | 9049.0 | 7662.0 | 1997 | 2027 | 30 | Scolel Té is a project that assists farmers an... | ... | good data | good data | good data | Ongoing | -90.680747 | 16.336757 | Yes | South America | State : Chiapas and State : Oaxaca | Municipality : Tuxtla Gutiérrez |
3 rows × 41 columns
data_original.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 624 entries, 0 to 623 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 project id 624 non-null int64 1 project name 624 non-null object 2 secondary name 261 non-null object 3 last idrecco update (yyyymmdd) 624 non-null datetime64[ns] 4 size (in hectare) 624 non-null float64 5 size of crediting area (in hectare) 624 non-null float64 6 start year 624 non-null int64 7 end year 624 non-null int64 8 duration 624 non-null int64 9 project description 620 non-null object 10 objective 1 624 non-null object 11 objective 2 624 non-null object 12 objective 3 623 non-null object 13 deforestation drivers 624 non-null object 14 type of forest 624 non-null object 15 project type 624 non-null object 16 details for afforestation/reforestation activity 324 non-null object 17 project located in an iucn protected area? 624 non-null object 18 dominant type 624 non-null object 19 was fpic used? 624 non-null object 20 was participatory approach used? 624 non-null object 21 id_country 624 non-null int64 22 location description 595 non-null object 23 project partners 448 non-null object 24 information sources 623 non-null object 25 name of the protected area 505 non-null object 26 size of the protected area (hectares) 505 non-null float64 27 estimated proportion of the project located in a protected area 505 non-null object 28 category of protected area (iucn classification) 504 non-null object 29 type of community participation 624 non-null object 30 data quality - carbon certification 624 non-null object 31 data quality - carbon transacations 624 non-null object 32 data quality - financing sources 624 non-null object 33 data quality - community interventions 624 non-null object 34 status 624 non-null object 35 longitude (decimal degrees) 624 non-null float64 36 latitude (decimal degrees) 624 non-null float64 37 multiple locations? 624 non-null object 38 region 624 non-null object 39 jurisdiction level 1 604 non-null object 40 jurisdiction level 2 571 non-null object dtypes: datetime64[ns](1), float64(5), int64(5), object(30) memory usage: 200.0+ KB
data_original.isnull().mean()
project id 0.000000 project name 0.000000 secondary name 0.581731 last idrecco update (yyyymmdd) 0.000000 size (in hectare) 0.000000 size of crediting area (in hectare) 0.000000 start year 0.000000 end year 0.000000 duration 0.000000 project description 0.006410 objective 1 0.000000 objective 2 0.000000 objective 3 0.001603 deforestation drivers 0.000000 type of forest 0.000000 project type 0.000000 details for afforestation/reforestation activity 0.480769 project located in an iucn protected area? 0.000000 dominant type 0.000000 was fpic used? 0.000000 was participatory approach used? 0.000000 id_country 0.000000 location description 0.046474 project partners 0.282051 information sources 0.001603 name of the protected area 0.190705 size of the protected area (hectares) 0.190705 estimated proportion of the project located in a protected area 0.190705 category of protected area (iucn classification) 0.192308 type of community participation 0.000000 data quality - carbon certification 0.000000 data quality - carbon transacations 0.000000 data quality - financing sources 0.000000 data quality - community interventions 0.000000 status 0.000000 longitude (decimal degrees) 0.000000 latitude (decimal degrees) 0.000000 multiple locations? 0.000000 region 0.000000 jurisdiction level 1 0.032051 jurisdiction level 2 0.084936 dtype: float64
sns.heatmap(data_original.isnull(), cbar=False)
<AxesSubplot:>
As reflected on the heatmap the following attributes has missing data:
Secondary name has missing values of ~58%, Details for Afforestation/Reforestation activity has ~48%, Project partners has ~28%Size of the protected area (hectares), Estimated proportion of the project located in a protected area and Category of protected area (IUCN classification) all are ~19%, Jurisdiction level 2 has ~8% Project description, Objective 3, Location description, Jurisdiction level 1 and Jurisdiction level 2For data types, its seems all attributes were correctly recognized. Naming convention didn't follow python preferred snake case, and it will cause some issues along the analysis, so to rename the columns after removing not used rows from the dataset.
data = data_original.drop(['last idrecco update (yyyymmdd)'
,'longitude (decimal degrees)'
,'latitude (decimal degrees)'
,'jurisdiction level 1'
,'jurisdiction level 2'
,'type of community participation'
,'multiple locations?'
,'category of protected area (iucn classification)'
,'data quality - financing sources'
,'name of the protected area'
,'information sources'
,'location description'
,'project partners'
,'type of forest'
,'was fpic used?'
,'dominant type'
,'data quality - carbon certification'
,'data quality - carbon transacations'
,'data quality - community interventions'
,'project located in an iucn protected area?'
,'was participatory approach used?'
,'details for afforestation/reforestation activity'
]
,axis = 1)
data = data.rename(columns =
{'project id':'project_id'
, 'project name': 'project_name'
, 'secondary name': 'secondary_name'
, 'size (in hectare)': 'size_in_hectare'
, 'size of crediting area (in hectare)': 'size_of_crediting_area_in_hectare'
, 'start year': 'start_year'
, 'end year': 'end_year'
, 'project description' : 'project_description'
, 'objective 1' : 'objective_1'
, 'objective 2' : 'objective_2'
, 'objective 3' : 'objective_3'
, 'deforestation drivers' : 'deforestation_drivers'
, 'project type' : 'project_type'
, 'size of the protected area (hectares)' : 'size_of_pa_hectares'
, 'estimated proportion of the project located in a protected area': 'estimated_proportion_of_project_located_in_pa'
})
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 624 entries, 0 to 623 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 project_id 624 non-null int64 1 project_name 624 non-null object 2 secondary_name 261 non-null object 3 size_in_hectare 624 non-null float64 4 size_of_crediting_area_in_hectare 624 non-null float64 5 start_year 624 non-null int64 6 end_year 624 non-null int64 7 duration 624 non-null int64 8 project_description 620 non-null object 9 objective_1 624 non-null object 10 objective_2 624 non-null object 11 objective_3 623 non-null object 12 deforestation_drivers 624 non-null object 13 project_type 624 non-null object 14 id_country 624 non-null int64 15 size_of_pa_hectares 505 non-null float64 16 estimated_proportion_of_project_located_in_pa 505 non-null object 17 status 624 non-null object 18 region 624 non-null object dtypes: float64(3), int64(5), object(11) memory usage: 92.8+ KB
df_c = pd.read_excel(r'C:\Users\Nomuun\Desktop\Env Econ\REDD.xlsx', sheet_name = "9. Country")
df_c.columns = [x.lower() for x in df_c.columns]
df_c.columns
Index(['id_country', 'country name', 'human development index (2019)',
'gdp (billion usd, 2019)', 'gdp per capita (usd/person, 2019)',
'population (2019)', 'forest area (2020, ha)', 'forest loss (2020)',
'annual deforestation rate 2015-2020 (%)',
'index of government effectiveness (2018)',
'index of corruption control (2018)',
'participation in global redd+ programs', 'forest tenure (2015)',
'comment'],
dtype='object')
df_c = df_c.rename(columns =
{'country name':'country_name'
, 'human development index (2019)': 'hdi_2019'
, 'gdp (billion usd, 2019)': 'gdp_bln_usd_2019'
, 'gdp per capita (usd/person, 2019)': 'gdp_per_capita_usd_2019'
, 'forest area (2020, ha)': 'forest_area_2020_ha'
, 'forest loss (2020)' : 'forest_loss_2020'
, 'annual deforestation rate 2015-2020 (%)': 'annual_deforestation_rate_2015_2020'
, 'index of government effectiveness (2018)' : 'index_of_government_effectiveness_2018'
, 'index of corruption control (2018)' : 'index_of_corruption_control_2018'
, 'participation in global redd+ programs' : 'participation_in_global_redd+_programs'
, 'forest tenure (2015)' : 'forest_tenure_2015'
, 'deforestation drivers' : 'deforestation_drivers'
, 'project type' : 'project_type'
})
merged_c = pd.merge(left = data, right = df_c, on = 'id_country')
merged_c.head(3)
| project_id | project_name | secondary_name | size_in_hectare | size_of_crediting_area_in_hectare | start_year | end_year | duration | project_description | objective_1 | ... | gdp_per_capita_usd_2019 | population (2019) | forest_area_2020_ha | forest_loss_2020 | annual_deforestation_rate_2015_2020 | index_of_government_effectiveness_2018 | index_of_corruption_control_2018 | participation_in_global_redd+_programs | forest_tenure_2015 | comment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100 | PRC | Commercial reforestation on lands dedicated to... | 3137.0 | 9999.0 | 2000 | 2030 | 30 | The proposed A/R CDM project activity consists... | biodiversity conservation | ... | 6432.4 | 50.33944 | 59141.91 | -198.55 | -0.332378 | -0.085226 | -0.301493 | FCPF|UNREDD|BioCarbon Fund Initiative for Sust... | 65.96% public; 30.42% private; 3.61% unknown. | NaN |
| 1 | 133 | Pachamama | NaN | 1980.0 | 1980.0 | 2009 | 2034 | 25 | Pachamama Forest is developing four different ... | biodiversity conservation | ... | 6432.4 | 50.33944 | 59141.91 | -198.55 | -0.332378 | -0.085226 | -0.301493 | FCPF|UNREDD|BioCarbon Fund Initiative for Sust... | 65.96% public; 30.42% private; 3.61% unknown. | NaN |
| 2 | 134 | San Nicolas Carbon Sequestration Project | San Nicolas CDM Reforestation Project | 1101.0 | 1101.0 | 2008 | 2027 | 20 | The project development objective is to pionee... | development;social development | ... | 6432.4 | 50.33944 | 59141.91 | -198.55 | -0.332378 | -0.085226 | -0.301493 | FCPF|UNREDD|BioCarbon Fund Initiative for Sust... | 65.96% public; 30.42% private; 3.61% unknown. | NaN |
3 rows × 32 columns
merged_c.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| project_id | 624.0 | 4.553237e+02 | 2.225397e+02 | 100.000000 | 266.750000 | 430.500000 | 622.250000 | 8.840000e+02 |
| size_in_hectare | 624.0 | 1.730173e+06 | 1.329796e+07 | 4.000000 | 2000.000000 | 10000.000000 | 87835.000000 | 2.377650e+08 |
| size_of_crediting_area_in_hectare | 624.0 | 2.520304e+05 | 1.975679e+06 | 5.000000 | 5345.750000 | 9999.000000 | 9999.000000 | 3.165530e+07 |
| start_year | 624.0 | 2.752077e+03 | 2.321714e+03 | 1979.000000 | 2008.000000 | 2011.000000 | 2014.000000 | 9.999000e+03 |
| end_year | 624.0 | 3.787580e+03 | 3.297166e+03 | 2006.000000 | 2032.000000 | 2042.000000 | 2075.500000 | 9.999000e+03 |
| duration | 624.0 | 2.155784e+03 | 4.085369e+03 | 1.000000 | 25.000000 | 30.000000 | 60.000000 | 9.999000e+03 |
| id_country | 624.0 | 3.585321e+02 | 2.386287e+02 | 24.000000 | 156.000000 | 356.000000 | 550.500000 | 8.940000e+02 |
| size_of_pa_hectares | 505.0 | 8.039826e+04 | 3.806587e+05 | 5.000000 | 9999.000000 | 9999.000000 | 9999.000000 | 4.268100e+06 |
| hdi_2019 | 624.0 | 6.750801e-01 | 1.080469e-01 | 0.377000 | 0.579000 | 0.707000 | 0.761000 | 8.470000e-01 |
| gdp_bln_usd_2019 | 624.0 | 1.682586e+03 | 3.740241e+03 | 0.283990 | 47.319620 | 272.119700 | 1258.286720 | 1.434290e+04 |
| gdp_per_capita_usd_2019 | 624.0 | 5.329442e+03 | 3.839604e+03 | 411.600000 | 1816.500000 | 4620.000000 | 8717.200000 | 1.727650e+04 |
| population (2019) | 624.0 | 2.471158e+02 | 4.325955e+02 | 0.299880 | 28.585490 | 52.573970 | 211.049530 | 1.397715e+03 |
| forest_area_2020_ha | 624.0 | 1.117752e+05 | 1.554321e+05 | 44.980000 | 12429.810000 | 59141.910000 | 92133.200000 | 4.966196e+05 |
| forest_loss_2020 | 624.0 | -1.668631e+02 | 7.812104e+02 | -1453.040000 | -469.000000 | -127.766000 | 0.000000 | 1.936786e+03 |
| annual_deforestation_rate_2015_2020 | 624.0 | -2.605186e-01 | 7.045008e-01 | -3.564301 | -0.616794 | -0.290045 | 0.000000 | 1.130403e+00 |
| index_of_government_effectiveness_2018 | 624.0 | -2.655185e-01 | 5.072811e-01 | -1.581517 | -0.578343 | -0.245431 | 0.179875 | 1.084034e+00 |
| index_of_corruption_control_2018 | 624.0 | -4.824229e-01 | 4.479420e-01 | -1.503398 | -0.822813 | -0.419696 | -0.271244 | 1.265534e+00 |
merged_c['count'] = 1
merged_c.groupby(['status'])['count'].sum()
status Abandoned 22 Cannot be confirmed 65 Ended 108 Ongoing 416 Planned 2 Temporarily paused 1 Terminated ahead of schedule 10 Name: count, dtype: int64
merged_c.groupby(['status','region']).status.count()
status region
Abandoned Africa 5
Asia 9
Oceania 1
South America 7
Cannot be confirmed Africa 16
Asia 14
South America 35
Ended Africa 34
Asia 38
Oceania 3
South America 33
Ongoing Africa 107
Asia 112
Oceania 5
South America 192
Planned Asia 1
South America 1
Temporarily paused Africa 1
Terminated ahead of schedule Africa 6
Asia 2
South America 2
Name: status, dtype: int64
df_ongoing = merged_c[merged_c.status == 'Ongoing'].groupby(['region','country_name','status']).agg({'count':'sum'}).sort_values(by = 'count', ascending = False).reset_index()
plt.figure(figsize = (8, 10))
plt.subplot(6,1,1)
sns.boxplot(x=merged_c.hdi_2019, color = 'lightblue')
plt.subplot(6,1,2)
sns.boxplot(x=merged_c.gdp_per_capita_usd_2019, color = 'red')
plt.subplot(6,1,3)
sns.boxplot(x=merged_c.size_in_hectare, color = 'lightblue')
plt.subplot(6,1,4)
sns.boxplot(x=merged_c.forest_loss_2020, color = 'blue')
plt.subplot(6,1,5)
sns.boxplot(x=merged_c.annual_deforestation_rate_2015_2020, color = 'lightblue')
plt.tight_layout()
forest_loss and annual_deforestation_rate_2015_2020 there are few outliers to be further explored.Note: Since our dataset has project focused, its generic information such as country name, human development index or forest loss are same accross all projects from one country.
fig = px.bar(df_ongoing, x = 'country_name', y = 'count', color = 'region'
, template="simple_white")
fig.update_layout(title_text = 'Number of REDD+ projects by Country')
fig.show()
#extracting top values from each countries
def top_bycountries(df, col_n, top_n = 5, rev = True):
table = {}
for i, row in df.iterrows():
country = row[19]
if table.get(country):
if table.get(country) < row[col_n]:
table[country] = row[col_n]
else:
table[country] = row[col_n]
if rev:
sorted_table = sorted(table.items(), key = lambda x: x[1], reverse = True)
else:
sorted_table = sorted(table.items(), key = lambda x: x[1], reverse = False)
for entry in sorted_table[:top_n]:
print(entry[0], ": ", entry[1])
# extracting top values from each projects
def top_(df, col_i, top_n =5, rev = True):
table = []
for i, row in df.iterrows():
country = row[19]
project = row[1]
hectar = row[col_i]
table.append((hectar, country, project))
if rev:
table_sorted = sorted(table, reverse = True)
else:
table_sorted = sorted(table, reverse = False)
for value, country, project in table_sorted[:top_n]:
print(f"{country} : {value} - {project}")
#Need to look up each column header's index, so it can be used for the functions created
name = merged_c.columns.to_list()
for i, row in enumerate(name):
print(i, row)
0 project_id 1 project_name 2 secondary_name 3 size_in_hectare 4 size_of_crediting_area_in_hectare 5 start_year 6 end_year 7 duration 8 project_description 9 objective_1 10 objective_2 11 objective_3 12 deforestation_drivers 13 project_type 14 id_country 15 size_of_pa_hectares 16 estimated_proportion_of_project_located_in_pa 17 status 18 region 19 country_name 20 hdi_2019 21 gdp_bln_usd_2019 22 gdp_per_capita_usd_2019 23 population (2019) 24 forest_area_2020_ha 25 forest_loss_2020 26 annual_deforestation_rate_2015_2020 27 index_of_government_effectiveness_2018 28 index_of_corruption_control_2018 29 participation_in_global_redd+_programs 30 forest_tenure_2015 31 comment 32 count
Now, I am going to look up top 5 in the attributes of size_in_hectare, forest_loss_2020 and annual_deforestation_rate_2015_2020 as they are closely related to indicator of project/country's forest condition.
#Top 5 projects by size_in_hectare
top_(merged_c, 3, 5)
Brazil : 237765000.0 - Jurisdictional program of the State of Rondônia in Brazil Brazil : 155915000.0 - Jurisdictional program of the State of Amazonas in Brazil Brazil : 124796000.0 - Jurisdictional program of the State of Pará in Brazil Brazil : 90337800.0 - Jurisdictional Program of the State of Mato Grosso in Brazil Peru : 36877300.0 - Jurisdictional program of the Region of Loreto in Peru
#Top 5 countries by size_in_hectare
top_bycountries(merged_c, 3, 5)
Brazil : 237765000.0 Peru : 36877300.0 Indonesia : 31655300.0 Congo, the Democratic Republic of the : 20055900.0 Mozambique : 10500800.0
#Top 5 countries by forest_loss_2020
top_bycountries(merged_c, 25, rev = False)
#since forest_loss are generic information that duplicated for each projects,
#I decided to see only by countries
Brazil : -1453.04 Congo, the Democratic Republic of the : -1101.376 Indonesia : -578.939999999999 Angola : -555.062 Tanzania, United Republic of : -469.0
#Top 5 countries by reversing forest loss in 2020
top_bycountries(merged_c, 25)
China : 1936.786 India : 266.4 Chile : 122.924 Viet Nam : 116.246 Philippines : 34.8880000000001
#Top 10 countries by the lowest annual_deforestation_rate_2015_2020
top_bycountries(merged_c, 26,10, rev = False)
Côte d'Ivoire : -3.56430144450434 Nicaragua : -2.70120239071242 Cambodia : -1.82526832300998 Malawi : -1.77500102488705 Uganda : -1.67673262256369 Paraguay : -1.64985182757532 Egypt : -1.463091361 Niger : -1.11222334352137 Tanzania, United Republic of : -0.994853448065125 Myanmar : -0.985164093818602
#Top 5 countries by the highest annual_deforestation_rate_2015_2020
top_bycountries(merged_c, 26, 10)
Uruguay : 1.13040324512219 China : 0.904478289907962 Viet Nam : 0.81333744346328 Chile : 0.689026609441834 Fiji : 0.596481536852722 Costa Rica : 0.548233894235706 Kenya : 0.498523542120721 Philippines : 0.492519097736999 Rwanda : 0.44054569626728 India : 0.373324586928514
Summary:
Based on what I have learned from above sections, i have developed the following hypothesis to test it. Of course there can be more questions can be verified, but for sake of reports length the numbers of questions were limited. Hypothesis:
#lets slice our dataset for further analysis
table = merged_c.groupby(['region', 'country_name','gdp_per_capita_usd_2019', 'forest_area_2020_ha'])['count'].sum().sort_values(ascending = False).reset_index()
table
| region | country_name | gdp_per_capita_usd_2019 | forest_area_2020_ha | count | |
|---|---|---|---|---|---|
| 0 | South America | Brazil | 8717.200 | 496619.60 | 77 |
| 1 | South America | Colombia | 6432.400 | 59141.91 | 57 |
| 2 | Asia | Indonesia | 4135.600 | 92133.20 | 54 |
| 3 | Asia | China | 10261.700 | 219978.18 | 48 |
| 4 | South America | Peru | 6977.700 | 72330.37 | 37 |
| ... | ... | ... | ... | ... | ... |
| 59 | Africa | Liberia | 621.900 | 7617.44 | 1 |
| 60 | Africa | Guinea-Bissau | 697.800 | 1980.01 | 1 |
| 61 | Africa | Egypt | 3020.000 | 44.98 | 1 |
| 62 | Africa | Congo | 2011.100 | 21946.00 | 1 |
| 63 | South America | Venezuela, Bolivarian Republic of | 3410.845 | 46230.90 | 1 |
64 rows × 5 columns
fig = px.scatter(table, y = 'count', x = 'gdp_per_capita_usd_2019', color = 'region'
, trendline = 'ols', marginal_y="violin", marginal_x="box", template="simple_white"
, hover_name = 'country_name', size = 'forest_area_2020_ha')
fig.show()
df_plot = merged_c.drop(['project_id', 'project_name', 'secondary_name','start_year'
, 'end_year','project_description', 'objective_1', 'objective_2'
,'objective_3', 'deforestation_drivers', 'project_type'
, 'id_country', 'participation_in_global_redd+_programs'
, 'forest_tenure_2015', 'comment', 'count', 'country_name'
, 'size_of_crediting_area_in_hectare']
, axis = 1)
df_plot.columns
Index(['size_in_hectare', 'duration', 'size_of_pa_hectares',
'estimated_proportion_of_project_located_in_pa', 'status', 'region',
'hdi_2019', 'gdp_bln_usd_2019', 'gdp_per_capita_usd_2019',
'population (2019)', 'forest_area_2020_ha', 'forest_loss_2020',
'annual_deforestation_rate_2015_2020',
'index_of_government_effectiveness_2018',
'index_of_corruption_control_2018'],
dtype='object')
plt.figure(figsize = (10,8))
heatmap = sns.heatmap(df_plot.corr(), annot = True, vmin = -1, vmax = 1, cmap='coolwarm')
heatmap.set_title('Correlation Heatmap', fontdict = {'fontsize':14}, pad = 12)
Text(0.5, 1.0, 'Correlation Heatmap')
From my datasample of REDD+ ongoing projects, there are few interesting high correlations were identified for further analysis:
Intuitively it make sense that country has high GDP tend to have high economic services such as agriculture, manufacturing and as GDP has stronger correlation with population, index of government effectiveness and corruption control, all those indirect correlations were reflected in the headmap.
#creating new column for caterogirizing gdp (PPP)
def gdp_bins(gdp):
if gdp > 15000:
return 'Very High'
if 10000 < gdp < 14999:
return 'High'
if 5000 < gdp < 9999:
return 'Medium'
else:
return 'Low'
merged_c['gdp_bins'] = merged_c['gdp_per_capita_usd_2019'].apply(gdp_bins)
print(merged_c.gdp_bins[:3])
0 Medium 1 Medium 2 Medium Name: gdp_bins, dtype: object
fig_1 = px.histogram(merged_c, x = 'gdp_per_capita_usd_2019', nbins = 10
, title = 'Are GDP per capita lower countries are more likely to implement REDD+ projects?'
, color_discrete_sequence=['indianred'])
fig_1.show()
skewness_gdp = merged_c.gdp_per_capita_usd_2019.skew()
print('Skewness: ', skewness_gdp)
print('\n')
# Chi_square test to check
Ho = "GDP per capita has no effect on numbers of REDD+ projects" # Stating the Null Hypothesis
Ha = "GDP per capita has effect on numbers of REDD+ projects" # Stating the Alternate Hypothesis
crosstab = pd.crosstab(merged_c['gdp_per_capita_usd_2019'], merged_c['gdp_bins'])
# Contingency table
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
if p_value < 0.01: # Setting our significance level at 1%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
print('\n')
print("p_value is: ", p_value)
Skewness: 0.6160996708978211 GDP per capita has effect on numbers of REDD+ projects as the p_value (0.0) < 0.01 p_value is: 6.424738786483014e-276
merged_c.columns
Index(['project_id', 'project_name', 'secondary_name', 'size_in_hectare',
'size_of_crediting_area_in_hectare', 'start_year', 'end_year',
'duration', 'project_description', 'objective_1', 'objective_2',
'objective_3', 'deforestation_drivers', 'project_type', 'id_country',
'size_of_pa_hectares', 'estimated_proportion_of_project_located_in_pa',
'status', 'region', 'country_name', 'hdi_2019', 'gdp_bln_usd_2019',
'gdp_per_capita_usd_2019', 'population (2019)', 'forest_area_2020_ha',
'forest_loss_2020', 'annual_deforestation_rate_2015_2020',
'index_of_government_effectiveness_2018',
'index_of_corruption_control_2018',
'participation_in_global_redd+_programs', 'forest_tenure_2015',
'comment', 'count', 'gdp_bins'],
dtype='object')
#creating new column for caterogirizing gdp (PPP)
def frst_bins(gdp):
if 0 < gdp < 100000:
return 'L1'
if 99999 < gdp < 200000:
return 'L2'
if 199999 < gdp < 300000:
return 'L3'
if 299999 < gdp < 400000:
return 'L4'
else:
return 'L5'
merged_c['forest_bins'] = merged_c['forest_area_2020_ha'].apply(frst_bins)
print(merged_c.forest_bins[:3])
0 L1 1 L1 2 L1 Name: forest_bins, dtype: object
fig_2 = px.histogram(merged_c, x = 'forest_area_2020_ha', nbins = 10
, title = "Is there any correlation between size of country's forest area and numbers of REDD+ project implementation?"
, color_discrete_sequence=['green'])
fig_2.show()
skewness_frst = merged_c.forest_area_2020_ha.skew()
print('Skewness: ', skewness_frst)
print('\n')
# Chi_square test to check
Ho = "Size of forest area has effect on numbers of REDD+ projects" # Stating the Null Hypothesis
Ha = "Size of forest area has no effect on numbers of REDD+ projects" # Stating the Alternate Hypothesis
crosstab = pd.crosstab(merged_c['forest_area_2020_ha'], merged_c['forest_bins'])
# Contingency table
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
if p_value < 0.01: # Setting our significance level at 1%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
print('\n')
print("p_value is: ", p_value)
Skewness: 1.7816248476554384 Size of forest area has no effect on numbers of REDD+ projects as the p_value (0.0) < 0.01 p_value is: 6.424738786484476e-276
fig_3 = px.histogram(merged_c, x = 'annual_deforestation_rate_2015_2020', nbins = 10
, title = "Does forest loss has impact on numbers of REDD+ project?"
, color_discrete_sequence=['blue'])
fig_3.show()
skewness_frstloss = merged_c.annual_deforestation_rate_2015_2020.skew()
print('Skewness: ', skewness_frstloss)
print('\n')
# Chi_square test to check
Ho = "Annual forest loss rate has effect on numbers of REDD+ projects" # Stating the Null Hypothesis
Ha = "Annual forest loss rate has no effect on numbers of REDD+ projects" # Stating the Alternate Hypothesis
crosstab = pd.crosstab(merged_c['annual_deforestation_rate_2015_2020'], merged_c['forest_bins'])
# Contingency table
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
if p_value < 0.01: # Setting our significance level at 1%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
print('\n')
print("p_value is: ", p_value)
Skewness: -0.7137900186619043 Annual forest loss rate has no effect on numbers of REDD+ projects as the p_value (0.0) < 0.01 p_value is: 5.882485968225877e-279
Through this data analysis practice, i am focused on REDD+ projects in relative to economic parameters of hosting country. to be continued...
[1] Griscom BW, Adams J, Ellis PW, Houghton RA, Lomax G, Miteva DA, Schlesinger WH, Shoch D, Siikama¨ ki JV, Smith P et al.: Natural climate solutions. Proc Natl Acad Sci 2017, 114:11645-11650